This Readme describes analog3.1. For the latest version of analog, see the analog home page. For examples of the output see
Analog is freeware, but its use is covered by a licence. You must agree to the terms of the licence before using the program.
This is a version of the Readme in one page. If you're reading it on line, you might prefer the version on several smaller pages. There is an index at the end of this document.
Now you can go to
If you log in to your ISP's machine from your home machine, you have two options. If you have the right permissions, you can run analog on your ISP's machine. Otherwise, you can download (e.g., ftp) the logfiles from their machine to yours, and then run analog on your machine.
Once you've downloaded the right version of analog for your computer from the analog home page (or a mirror site), you need to know how to set it up and run it. This is very easy, but the instructions are slightly different depending which platform you're using.
LOGFILE logfilename # to set where your logfile livesThe logfile must live on your local disk -- analog doesn't fetch it from across the network. There's a sample logfile supplied with the program.
There's a list of basic commands later in the Readme. Also there are a few to get you started in the configuration file already, but there are lots of others available. You can read about all the commands in the section on customising analog.
One note: on other platforms, there is another way to give options, via command line arguments. You'll see these mentioned in this Readme from time to time, but the Mac doesn't have a command line, so ignore these.
If you want to compile your own version of analog (it's written in C), or just to read the source code, it's available from the analog home page. (It's the same source code for all versions).
There are two ways of running analog. You can either run it from Windows by double-clicking on its icon, or you can run it from the DOS command prompt (under Start-Programs). If you run it from Windows, it will create a DOS window to run in. When it's finished, it will produce an output file called Report.html.
LOGFILE logfilename # to set where your logfile livesThe logfile must live on your local disk -- analog doesn't fetch it from across the network. There's a sample logfile supplied with the program.
There's a list of basic commands later in the Readme. Also there are a few to get you started in the configuration file already, but there are lots of others available. You can read about all the commands in the section on customising analog.
In some ways, it's easier to run analog from the DOS command prompt, because you get to see any error or warning messages more easily. Also, if you run analog from the command prompt, there is another way to give options, via command line arguments, given on the command line after the program name. These are just shortcuts for configuration file commands.
If you want to compile your own version of analog (it's written in C), or just to read the source code, it's available from the analog home page. (It's the same source code for all versions).
LOGFILE logfilename # to set where your logfile livesYou need to use \ not / as the directory separator in the logfile name. The logfile must live on your local disk -- analog doesn't fetch it from across the network. There's a sample logfile supplied with the program.
There's a list of basic commands later in the Readme. Also there are a few to get you started in the configuration file already, but there are lots of others available. You can read about all the commands in the section on customising analog.
There is one other way to give options to analog, via command line arguments, given on the command line after the program name. These are just shortcuts for configuration file commands.
If you want to compile your own version of analog (it's written in C), or just to read the source code, it's available from the analog home page. (It's the same source code for all versions). There are instructions about compiling on another page.
First, you will want to look at the file analhead.h. These are all user-settable options, but most of them you can override later. You will probably want to check the first few options in the file, but you can even leave most of them until later.
When you have done that, you need to compile the program. How to do that depends on which system you're using.
maketo compile the program. On most systems, that will be sufficient. If it fails to compile, have a look in the Makefile to see if there's anything that you need to change to suit your configuration, and try again. It says in that file what to do. In particular, Solaris 2 (SunOS 5) users need to change the LIBS= line (and may need to change the DEFS= line -- see below).
If you haven't got gcc, you will need to change the compiler - try acc or cc instead. If it still doesn't compile, try DEFS=-DNODNS to ignore the DNS lookup code.
There is a known problem with HP-UX 10 and some versions of gcc. If it complains about an error in the <sys/stat.h> library, you need to upgrade to gcc version 2.7.2.3 or later, or use HP's cc compiler. HP's compiler is not an ANSI C compiler by default, so you need to specify -Ae in the CFLAGS to tell the compiler to use ANSI C.
SunOS 4's cc doesn't seem to have the necessary header files for ANSI C. Often gcc doesn't work either -- you will probably need to use acc.
SunOS 5 sometimes seems to have a broken strcmp() function. If you get an "illegal instruction" error when running analog, compile it with the -DNOSTRCMP in the DEFS= line.
Compiling under VMS. Type
MMSto compile analog. Under VMS 7.0 & 7.1, there is a VMS bug that stops analog compiling. The fix is to add "/define=(_VMS_V6_SOURCE)" to the cflags definitions at the top of the file descrip.mms.
Compiling under Acorn RiscOS. The Makefile is called Make.Risc, and you will have to rename it to Makefile before running make. Also you have to make directories called C, H and O, and move the sources files into the appropriate directories: e.g., alias.c must be renamed C.alias. And you will find that there are some filenames in the header file analhead.h that you want to change to fit into the RiscOS directory structure.
Compiling under OS/2. Although there is a precompiled version of analog for OS/2, if you want to compile your own you will need the EMX package. You should edit the Makefile to have OS=OS2 and LIBS=-lsocket. Then after running Make, you need to run the command
EMXBIND -b ANALOGto generate the analog.exe executable.
analogto run the program. (Or ./analog if for some reason . isn't in your $PATH.)
You can configure analog by putting commands in the configuration file, which is called analog.cfg by default. Two commands you will need straight away are
LOGFILE logfilename # to set where your logfile lives OUTFILE outputfile.html # to send the output to a file instead of the screenThe logfile must live on your local disk -- analog doesn't fetch it from across the network. There's a sample logfile supplied with the program.
There's a list of basic commands later in the Readme. Also there are a few to get you started in the configuration file already, but there are lots of others available. You can read about all the commands in the section on customising analog.
There is one other way to give options to analog, via command line arguments, given on the command line after the program name. These are just shortcuts for configuration file commands.
The following section is a technical (i.e., dull but important) section on the
Then there's documentation on all the configuration commands in the following categories. Analog has over 200 configuration commands, as well as several command line options, so sometimes these sections turn into lists of commands. But here's where you find out everything you can do with analog. There's also an index of all the commands and topics on a separate page.Analog reads logfiles produced by your web server, and produces an output file based on the data in them. So you need to know how to specify which logfile to read, and which file to send the output to. The relevant commands look like
LOGFILE my_logfile OUTFILE output.htmlwhere, of course, you should substitute the names of the files you want to use. The logfile must be on your local disk -- analog doesn't fetch it from across the network, so if it's not on your local disk, you will have to fetch it yourself first. You can read several logfiles by giving several logfile commands, or by giving a comma-separated list, or by using wild cards in the logfile name. So, for example, if you use the commands
LOGFILE new1.log,old*.log LOGFILE new2.loganalog will analyse the logfiles new1.log, new2.log, and all the old logfiles. Analog will recognise logfiles in several different formats. You can read more about this in the section Choosing a logfile.
HOSTNAME "Spam Widgets Inc." HOSTURL www.spam-widgets.com
If you have broken images in the output instead of graphs, you need to say in which directory on your server the images are stored. You do this by a command like
IMAGEDIR /analog/images/(The images are distributed with the program - you will have to move them to whichever directory you choose.)
MONTHLY ON # one line for each month WEEKLY ON # one line for each week FULLDAILY ON # one line for each day DAILY ON # one line for each day of the week HOURLY ON # one line for each hour of the day GENERAL ON # the General Summary at the top REQUEST ON # which files were requested FAILURE ON # which files were not found DIRECTORY ON # directory report HOST ON # which computers requested files DOMAIN ON # which countries they were in REFERRER ON # where people followed links from FAILREF ON # where people followed broken links from BROWSER ON # which browsers people were using FILETYPE ON # types of file requested SIZE ON # sizes of files requestedThe referrer and browser reports will only appear if your server records the necessary information. You can configure lots of other things about each report, such as how many rows are listed, which columns are included, and how the reports are sorted. For example, the command
REQINCLUDE pagestells analog only to list pages, rather than all files, in the request report. You can read about all the options in the sections on Time reports, Other reports and Hierarchical reports.
LANGUAGE FRENCHwill give you the output in French. The available languages at the moment are CHINESE, CZECH, DANISH, DUTCH, ENGLISH, US-ENGLISH, FINNISH, FRENCH, GERMAN, GREEK, HUNGARIAN, ITALIAN, NORWEGIAN (Bokmål), NYNORSK, POLISH, PORTUGUESE, BR-PORTUGUESE, ROMANIAN, RUSSIAN, SLOVAK, SLOVENE, SPANISH, SWEDISH and TURKISH. See the section on Configuring the output for how to download, or even translate, new languages.
As I said, these are only a few of the commands available. To find out about all the commands, you'll have to read the remaining sections of the Readme, starting with a short section on the syntax of configuration commands.
CONFIGFILE other.cfgThe commands in the other configuration file are read immediately, in order. The program then continues reading the command line or calling configuration file where it left off. Note that reading an alternative configuration file does not stop the default configuration file being read as well. To do that you have to specify -G as well as the +g command.
In the Mac version, you can start up a program with a particular configuration file by dragging it onto the analog icon. The configuration file must start with a #. The default configuration file is still read first.
You can also specify any configuration command on the command line even if it doesn't have a command line abbreviation, by use of the +C command. For example, +C"UNCOMPRESS *.gz" will include that command.
DAILY OFF # We don't want a daily summary FULLDAILY "ON" # We want a full daily report instead HOSTNAME (Spam Widgets Inc.) # Spaces, so quotes or brackets neededGenerally later commands override earlier ones if there is a conflict (e.g., for the OUTFILE, because you can have only one), or supplement them if there is no conflict (e.g., for the LOGFILE, because you can read several logfiles).
analog -settings [other options]or include PRINTVARS ON in the configuration commands. That will tell you what the values of all the variables will be, based on the defaults in analhead.h, the configuration commands, and the command line options. If you're on Unix or Windows, remember that you can send the output to a file with
analog -settings > file
LOGFILE logfilenameor just to put the logfile name on the command line without any arguments, e.g., analog logfilename. A - sign or the word stdin is interpreted as standard input: this is useful on Unix systems for constructing pipes. The word none means that the list of logfiles specified so far is erased. All logfiles must be on your local disk -- analog doesn't fetch them from across the network. In the Mac version, you can also analyse a particular single logfile by dragging it onto the analog icon.
You can have several LOGFILE commands. You can include wildcards in the logfile name (but not necessarily in the directory name: this is system-dependent), and you can use a list of logfiles separated by commas (without spaces). So the following commands would tell analog to read logfile1, c:\logs\logfile2, and all files ending in .log:
LOGFILE logfile1,*.log LOGFILE c:\logs\logfile2The LOGFILE commands are cumulative, except that any logfiles on the command line or in user-specified configuration files override any in the default configuration file, and are themselves overridden by any in the mandatory configuration file.
The reason for the "sometimes" in the previous paragraph is as follows. The Microsoft and Netpresenz formats are extremely badly designed in that the date can occur in either of the forms date/month/year or month/date/year, and they don't say which they're using. Analog will detect them automatically if it can tell which date format is being used (e.g., 13/2/98 or 2/13/98), but if it can't, it will tell you to use one of the LOGFORMAT strings below. Also, the NCSA browser log can only be detected if it includes the date.
When you start up analog, all logfiles have the default logfile format. This is normally automatic detection, as explained above, but you can change it if your logfiles are always in a format which analog doesn't know about. You do this by means of the command
DEFAULTLOGFORMAT format-- we'll discuss what the formats can be in a minute.
Sometimes you might want to analyse several logfiles with different formats. For this you need the LOGFORMAT command. This command only applies to future logfiles in the same configuration file. So if you change the format with a command like
LOGFORMAT formatthen any logfiles you select with a LOGFILE command later in the same configuration file will get the new format. If you put the LOGFORMAT after the LOGFILE command, it will not take effect for that logfile, and you will most likely get a "can't auto-detect format" warning.
The possible formats for use with the DEFAULTLOGFORMAT and LOGFORMAT commands are of two types. First there are some symbolic words, and then there are log format strings. We'll look at the words first.
There are format words for all the built-in formats analog knows about. For example, COMMON will select common format; you can also have COMBINED, REFERRER, BROWSER, EXTENDED, MICROSOFT-NA (North American date format), MICROSOFT-INT (international date format), MS-EXTENDED (Microsoft's attempt at extended format), NETSCAPE, WEBSTAR, NETPRESENZ-NA (North American) or NETPRESENZ-INT (international). There are also the words AUTO for automatic detection and DEFAULT for whatever the default log format is.
If your logfile is not in one of the recognised formats, you can tell analog about your format using a log format string. You only ever need this if your logfile has lines which are not in one of the standard formats. The format string consists of a template for the logfile line, with the various fields and special characters replaced by codes as follows.
jay.bird.com - fred [14/Mar/1996:17:45:35 +0000] "GET /~sret1/ HTTP/1.0" 200 1243can be represented by the LOGFORMAT command
LOGFORMAT (%S - %u [%d/%M/%Y:%h:%n:%j] "%j %r %j" %c %b)including two items, host and file. (The parentheses are needed because the argument contains spaces. Note also the use of %j to ignore two fields, the seconds and the timezone.)
Logfiles often contain lines in several different formats, so you can specify several log formats one after the other and they will accumulate. For example, the definition of common format should also include the line
LOGFORMAT (%S - %u [%d/%M/%Y:%h:%n:%j] "%j %r" %c %b)to handle lines where the HTTP/1.0 part of the request is absent. Or you might use
LOGFORMAT COMMON LOGFORMAT COMBINEDto represent a logfile which had lines in both those formats. Analog tries to match the line to the first format first, then if that fails the next, and so on, so the order of the formats is important. Usually you want to specify the most common one first, to minimise the time spent trying to match lines to inappropriate formats. The DEFAULTLOGFORMAT also accumulates in this way.
The log formats which analog can handle are those which are known as instantaneously decipherable: this means that the character which terminates a string can never occur in the string. In the above example, if the hostname ever contained a space, the line would be marked as corrupt, because analog terminates the host at the first space, not at the first occurrence of space-dash-space, and then the rest of the line wouldn't match. Of course, hostnames should never contain spaces, so this shouldn't be a problem. There are a couple of other restrictions: if there is any date or time information, then the year, month, date, hour and minute must all be present: and the same information may not occur twice in the format (so you can't have both %m and %M, for example).
Sometimes you need to read one of the fields in a logfile, but not analyse it. For example, if you have a separate common log and referrer log, the referrer log might look like
[14/Mar/1996:17:48:10] http://guide-p.infoseek.com/Titles -> /~sret1/analog/But the requests for /~sret1/analog/ would already have been counted when reading the main logfile, so you don't want to count them again now. You get round this by specifying a * in that item in the format string, like this:
LOGFORMAT ([%d/%M/%Y:%h:%n:%j] %f -> %*r)Any of the seven items can be treated in this way.
Here are the exact rules about which logfile gets which log formats. The default logfile format starts off at AUTO. You can change it with a DEFAULTLOGFORMAT command, and then the default format accumulates unless you specify DEFAULTLOGFORMAT AUTO to return to automatic detection.
The current logfile format starts off at DEFAULT. You can change it with a LOGFORMAT command, and then the current format accumulates until a LOGFILE command intervenes; then it restarts at the next LOGFORMAT command. It also restarts if you specify LOGFORMAT AUTO or LOGFILE DEFAULT; or when the current format is reset to DEFAULT automatically, which happens at the end of the command line, and of every configuration file, and whenever a LOGFILE none command is encountered.
The default logfile selected at compilation time always gets the default format (although exactly what the default format is can still be changed with a DEFAULTLOGFORMAT command). Any logfile declared later, in a configuration file for example, gets the current log format at the time it is selected. If you specify several logfiles, they will all use the same format, unless there's a LOGFORMAT command or an implicit return to DEFAULT format between them.
LOGFILE log1,log2 http://www.%v.mydomain.comwould translate a filename /file.html with virtual host spam in log1 or log2 to http://www.spam.mydomain.com/file.html. If you are using the second argument to the LOGFILE command, you will probably want to use the SUBDIR command as well.
If %v is included in the argument and the logfile line doesn't have a virtual host, that line will be marked as corrupt. If VHOSTLOWMEM 3 is specified, the %v's will not be translated and will just appear as %v in the output.
LOGTIMEOFFSET -300 LOGFILE summer*.log LOGTIMEOFFSET -360 LOGFILE winter*.log
While we're on the subject of time offsets, there is one other similar command, which is not directly to do with logfiles. You can specify a TIMEOFFSET command to say how much analog should offset the time of the computer on which it is running, to get your local time.
UNCOMPRESS *.gz,*.Z /usr/bin/gzcatwhereas on Windows NT, you might use
UNCOMPRESS *.gz "c:\Program Files\gzip\gzip -cd"and on VMS, it could be
UNCOMPRESS *.LOG-GZ;* "gunzip -c"This would be a suitable command to include in the default configuration file.
If analog determines when it starts to uncompress a logfile that that file isn't wanted for the analysis, two undesirable things can happen. Either the program might pause until the logfile is fully uncompressed, or there might be a "broken pipe" error reported. This is system dependent, and out of analog's control.
The common logfile format is written by most servers. Its lines look like
jay.bird.com - fred [14/Mar/1996:17:45:35 +0000] "GET /~sret1/ HTTP/1.0" 200 1243Specifying LOGFORMAT COMMON is the same as specifying the three commands
LOGFORMAT (%S %j %u [%d/%M/%Y:%h:%n:%j] "%j %r %j" %c %b) LOGFORMAT (%S %j %u [%d/%M/%Y:%h:%n:%j] "%j %r" %c %b) LOGFORMAT (%S %j %u [%d/%M/%Y:%h:%n:%j] "%r" %c %b)
[14/Mar/1996:17:48:10] http://guide-p.infoseek.com/Titles -> /~sret1/analog/and the browser (or agent) log looks like
[14/Mar/1996:17:45:08] Mozilla/2.0 (X11; I; HP-UX A.09.05 9000/735)The respective LOGFORMAT commands are
LOGFORMAT ([%d/%M/%Y:%h:%n:%j] %f -> %*r) LOGFORMAT ([%d/%M/%Y:%h:%n:%j] %B)In both of these logfiles the date can be omitted, except if the date is omitted in the browser log, analog will not be able to detect the log format automatically. (It doesn't contain enough clues, so there is too much danger of confusing other log formats with it; just use "LOGFORMAT %B").
jay.bird.com - fred [14/Mar/1996:17:45:35 +0000] "GET /~sret1/ HTTP/1.0" 200 1243 "http://www.statslab.cam.ac.uk/" "Mozilla/2.0 (X11; I; HP-UX A.09.05 9000/735)"except all one line. If you are using the Apache server, you can generate this with the mod_log_config module, using the command
LogFormat "%h %l %u %t \"%r\" %s %b \"%{Referer}i\" \"%{User-Agent}i\""The corresponding LOGFORMAT commands are
LOGFORMAT (%S %j %u [%d/%M/%Y:%h:%n:%j] "%j %r %j" %c %b "%f" "%B") LOGFORMAT (%S %j %u [%d/%M/%Y:%h:%n:%j] "%j %r" %c %b "%f" "%B") LOGFORMAT (%S %j %u [%d/%M/%Y:%h:%n:%j] "%r" %c %b "%f" "%B")It is usually better to use the combined log than separate logs, because it stores more information in less space.
The extended log is described at http://www.w3.org/TR/WD-logfile.html. Its header line looks like
#Fields: date time cs-uriIn the rest of the logfile, the fields can be separated by spaces or tabs. There is also Microsoft's attempt at the extended format -- unfortunately they didn't read the spec., so they didn't enclose the browser and referrer in quotes, and they replaced spaces in the browser name with +'s.
The WebSTAR file has a header line like
!!LOG_FORMAT DATE TIME RESULT URL BYTES_SENT HOSTNAMEIn the rest of the logfile, the fields are separated by tabs. Some other Mac servers also use the WebSTAR format, or something looking like it. Analog will understand these too. Finally, the Netscape header line looks like
format=%Ses->client.ip% [%SYSDATE%] "%Req->reqpb.clf-request%" %Req->srvhdrs.clf-status% %Req->srvhdrs.content-length%
Sometimes these three logfile formats can contain header lines which refer to the same item in two different ways. Analog doesn't know which one you want to count, so such header lines will generate a "corrupt format line" warning. You can then use a LOGFORMAT command to specify the format more precisely.
192.64.25.41, -, 21/02/97, 00:03:46, W3SVC1, SPIDER, 192.16.225.10, 30, 303, 1455, 200, 0, GET, /siege.htm, -,(except all on one line) or
LOGFORMAT (%S, %u, %d/%m/%y, %h:%n:%j, W3SVC%j, %j, %v, %j, %j, %b, %c, %j, %j, %r, %j,)However, the format is extremely badly designed, in that the date follows local conventions: in other words, in North America the above example would have the date 02/21/97 instead. Analog will diagnose which form the logfile is in if possible: but if both the date and the month are at most 12, there is no way to tell which format it is. In this case, you need to use the LOGFORMAT command MICROSOFT-NA for North American date format, or MICROSOFT-INT for international date format. It may even be that the date is in neither of these formats, in which case you need to use a LOGFORMAT command of your own.
There are also various third-party extensions to the Microsoft format to include, for example, the browser and referrer. Analog can't automatically diagnose these: you need to write a LOGFORMAT string for them.
5:54 pm 14/11/96 134.87.19.110 HTTP get file Research.html Web:Research:Research.html Referer: http://guide-p.infoseek.com/TitlesThe fields are separated by tabs. It is equivalent to four LOGFORMAT commands:
LOGFORMAT (%h:%n %aM\t%m/%d/%y\t%S\tHTTP\t\t%C\t%j\t\n%R\nReferer: %f) LOGFORMAT (%h:%n %aM\t%m/%d/%y\t%S\tHTTP\t\t%C\t%j\t\n%R) LOGFORMAT (%h:%n %aM\t%m/%d/%y\t%S\tHTTP\t\t%C\t%R) LOGFORMAT (%j)Again, the Netpresenz format uses local conventions for the date and time. Analog will diagnose it where it can: otherwise, you will have to use
LOGFORMAT NETPRESENZ-NA # dates like 9:14 AM 3/23/98 (upper case AM)or
LOGFORMAT NETPRESENZ-INT # dates like 9:14 am 23/3/98 (lower case am)Again, it can be that the date and time is in neither of these forms, in which case you will have to enter your own LOGFORMAT string.
CASE INSENSITIVE CASE SENSITIVE
Next it applies built-in aliases to each item. For example, it knows that %7E in a filename or referrer is equivalent to ~ and translates it accordingly. It also strips off the directory suffix from any filenames which have it. This suffix is normally index.html, but you can specify another one instead with a command such as
DIRSUFFIX default.htm(You can only have one DIRSUFFIX.) There are other built-in aliases for other items: for example, hostnames are converted to lower case at this point.
FILEALIAS /football.html /soccer.html HOSTALIAS lion lion.statslab.cam.ac.ukThere is also the special command FILEALIAS none, which cancels any other file aliases which might have been specified.
The alias commands for the other items are called BROWALIAS, REFALIAS, USERALIAS and VHOSTALIAS. Only one alias is ever applied to any item. So after
FILEALIAS /football.html /soccer.html FILEALIAS /soccer.html /brazil.htmlthe file /soccer.html would get translated to /brazil.html, but /football.html would only get translated to /soccer.html and would not see the second alias.
You can also use wildcards (? and *) in alias commands. The left hand side can contain at most one *, unless the right hand side contains no *'s. If the right hand side contains a * too, then the part of the name represented by the * on the left hand side will be substituted at the position of the * on the right hand side. So, for example,
FILEALIAS /football/* /soccer/would translate /football/rules.html to /soccer/, but
FILEALIAS /football/* /soccer/*would translate /football/rules.html to /soccer/rules.html.
TYPEOUTPUTALIAS .txt ".txt (Plain text files)"would provide an explanation of that line in the file type report.
There can be some confusion between some ALIAS and OUTPUTALIAS commands. For example, what is the difference between HOSTALIAS and HOSTOUTPUTALIAS? In fact, there are several differences, resulting from the different times at which the aliases are processed. The HOSTALIAS applies to the host items, but the HOSTOUTPUTALIAS only applies to the lines in the host report. This means that the HOSTALIAS also affects the other reports which use the hosts, such as the domain report, whereas the HOSTOUTPUTALIAS only affects the host report. Also the HOSTOUTPUTALIAS applies separately to each line of the host report. This means that if two separate hosts translate to the same thing in a HOSTALIAS command, they will become one host ever after. But if one were to use the same HOSTOUTPUTALIAS commands, there would be two hosts, which would just happen to have the same name in one report.
In summary, HOSTALIAS would normally be used if a single host had two different names, so might otherwise appear to be two hosts, whereas HOSTOUTPUTALIAS would normally be used to annotate or clarify the host report.
The full list of output aliases is REQOUTPUTALIAS, REDIROUTPUTALIAS, FAILOUTPUTALIAS, TYPEOUTPUTALIAS, DIROUTPUTALIAS, HOSTOUTPUTALIAS, DOMOUTPUTALIAS, REFOUTPUTALIAS, REFSITEOUTPUTALIAS, REDIRREFOUTPUTALIAS, FAILREFOUTPUTALIAS, BROWOUTPUTALIAS, FULLBROWOUTPUTALIAS, VHOSTOUTPUTALIAS, USEROUTPUTALIAS and FAILUSEROUTPUTALIAS.
There is one known bug with OUTPUTALIAS. The report is sorted before the OUTPUTALIAS is applied. This means that if the SORTBY for the report is set to ALPHABETICAL, then the report will not be sorted correctly.
The rule for determining whether an item is included or excluded is as follows. All the INCLUDE and EXCLUDE commands for that item are considered one by one in order, and the item is included or excluded according to the last command it matched. Items which don't match any of the INCLUDE or EXCLUDE commands are included if the first command was an exclusion, and excluded if the first command was an inclusion. For example, the configuration
FILEINCLUDE /~sret1/* FILEEXCLUDE /~sret1/backgammon/*,/~sret1/analog/* FILEINCLUDE /~sret1/backgammon/*.gifwould instruct the program to examine only my files, excluding my backgammon and analog files, but including gifs in my backgammon directory. On the other hand,
FILEEXCLUDE /~sret1/*/img/*would analyse all files, except for images in my various directories. Note that inclusions and exclusions can contain any number of wildcards.
The relevant commands for the other types of item are HOSTINCLUDE and HOSTEXCLUDE; BROWINCLUDE and BROWEXCLUDE; REFINCLUDE and REFEXCLUDE; USERINCLUDE and USEREXCLUDE; and VHOSTINCLUDE and VHOSTEXCLUDE. If you get confused with all the inclusions and exclusions, remember that you can always run analog -settings to see what the options you have specified represent.
FROM 990701 TO 000630Alternatively, each of the components can be preceded by + or - to represent time relative to the time at which the program was invoked. In this case, the date can have more than 2 digits. This allows constructions like
FROM -01-00+01 # from tomorrow last year TO -00-0131 # to the end of last month (OK even if last month # didn't have 31 days) FROM -00-00-112 TO -00-00-01 # statistics for the last 16 weeks FROM -00-00-00:-06+01 # statistics for the last 6 hoursThere are command line abbreviations +F and +T for the FROM and TO commands; for example, +T-00-00-01:1800 looks at statistics until 6pm yesterday. -F and -T turn off the from and to, as do FROM OFF and TO OFF.
REFREPEXCLUDE http://www.yahoo.com/*would exclude Yahoo! referrers from the referrer report. However, it would not exclude them from the failed referrer report, the referring site report, etc. (you need to use FAILREFEXCLUDE, REFSITEEXCLUDE etc. for that); nor would it prevent other analysis of logfile lines with those referrers, as REFEXCLUDE would. Also REFREPEXCLUDE would include the referrers in the "not listed" line at the bottom of the report.
The full list of these commands is REQINCLUDE and REQEXCLUDE; REDIRINCLUDE and REDIREXCLUDE; FAILINCLUDE and FAILEXCLUDE; TYPEINCLUDE and TYPEEXCLUDE; DIRINCLUDE and DIREXCLUDE; HOSTREPINCLUDE and HOSTREPEXCLUDE; DOMINCLUDE and DOMEXCLUDE; REFREPINCLUDE and REFREPEXCLUDE; REFSITEINCLUDE and REFSITEEXCLUDE; REDIRREFINCLUDE and REDIRREFEXCLUDE; FAILREFINCLUDE and FAILREFEXCLUDE; BROWSUMINCLUDE and BROWSUMEXCLUDE; FULLBROWINCLUDE and FULLBROWEXCLUDE; VHOSTREPINCLUDE and VHOSTREPEXCLUDE; USERREPINCLUDE and USERREPEXCLUDE; and FAILUSERINCLUDE and FAILUSEREXCLUDE. The inclusion or exclusion applies to the unaliased name, if you are doing any output aliases.
REQINCLUDE pagesto include only pages in the request report.
Analog determines which files should count as pages (and thus which requests count as page requests) using another INCLUDE/EXCLUDE pair, called PAGEINCLUDE and PAGEEXCLUDE. By default, *.html, *.htm and directories (*/) count as pages. But you change the list by commands like
PAGEINCLUDE *.ps,*.ps.gz PAGEEXCLUDE sret1.html(I.e., Postscript and gzipped Postscript are pages, but sret1.html isn't).
Finally, there are commands called ARGSINCLUDE and ARGSEXCLUDE, and REFARGSINCLUDE and REFARGSEXCLUDE. Sometimes a URL contains arguments after a question mark. For example, the URL
/cgi-bin/script.pl?x=1&y=2runs the /cgi-bin/script.pl program with arguments x=1 and y=2. (Sometimes the server records the arguments in a separate field in the logfile, but if so you can use the %q field in the LOGFORMAT command, and analog will translate the filename to the above format).
Analog can either read or ignore the arguments. If the command ARGSEXCLUDE /cgi-bin/script.pl were given, analog would ignore the arguments to that file, and so treat the above URL as being the same as /cgi-bin/script.pl. On the other hand, if ARGSINCLUDE /cgi-bin/script.pl were specified, analog would read the arguments, and treat the above URL as a different file from /cgi-bin/script.pl (or from /cgi-bin/script.pl?y=2&x=1), although a grand total for /cgi-bin/script.pl would still be listed in the Request Report.
REFARGSINCLUDE and REFARGSEXCLUDE are the same for referrers. By default, all arguments are included. The check for whether the arguments should be included happens before the filename is aliased: this means that you can't use pages in this command, because we don't know whether a file is a page until after it's been aliased.
If you want to see the arguments in a report you may also have to set the appropriate ARGSFLOOR command.
There are 27 different reports which analog can produce, if your logfiles contain the necessary information. Each one has a short name, and a code letter or number, as follows:
x GENERAL General Summary m MONTHLY Monthly Report W WEEKLY Weekly Report D FULLDAILY Daily Report d DAILY Daily Summary H FULLHOURLY Hourly Report h HOURLY Hourly Summary 4 QUARTER Quarter-Hour Report 5 FIVE Five-Minute Report S HOST Host Report o DOMAIN Domain Report r REQUEST Request Report i DIRECTORY Directory Report t FILETYPE File Type Report z SIZE File Size Report E REDIR Redirection Report I FAILURE Failure Report f REFERRER Referrer Report s REFSITE Referring Site Report k REDIRREF Redirected Referrer Report K FAILREF Failed Referrer Report B FULLBROWSER Browser Report b BROWSER Browser Summary v VHOST Virtual Host Report u USER User Report J FAILUSER Failed User Report c STATUS Status Code ReportFor details on what the various reports mean, see the section on What the results mean. But in brief, the General Summary gives summary statistics, such as the total number of requests of each type. The next eight reports are known as time reports; they show the pattern of requests over time. The Host Report and the Domain Report show where people visited from. The Request Report, Directory Report, File Type Report and Size Report show what files people got from your server. The Redirection Report shows files which were redirected to some other file, including "click-thru's." The Failure Report shows files which your server couldn't send out for some reason. The various Referrer Reports show where people followed links from to reach your files. (The Failed Referrer Report is good for spotting broken links.) The Browser Report and Browser Summary show which browsers people were using. If you are using virtual hosts, the Virtual Host Report shows how many requests there were to each virtual host. Similarly if you are using user authentication, the User Report and Failed User Report list the activity for each user. Finally, the Status Code Report shows how many requests returned each HTTP status code.
FIVE OFF REFSITE ONor by using command line arguments like -5 and +s. You can also turn all reports except the General Summary on or off with the commands ALL ON and ALL OFF, or with the command line arguments +A and -A.
You can turn the "Go To" lines in the report off with the command
GOTOS OFFor with the -X command line argument; again, GOTOS ON and +X turn them on again.
The figures in parentheses in the General Summary are for the last seven days: either the seven days before the TO time, or if no TO time is given, the seven days before the time of the program start. The figures for the last seven days are normally included if some, but not all, of the requests fall in those seven days; but you can turn them off by means of the command
LASTSEVEN OFFOf course LASTSEVEN ON turns them on again.
You can change the order of the reports by means of the REPORTORDER command. You should list the code letters for all the reports in the order you want them, like this:
REPORTORDER xcmdDhH45WriSoEItzsfKkuJvbB
You can change which file the output goes to with a command like
OUTFILE stats.htmor with a command line argument like +Ostats.htm. If you use the filename - or stdout, the output will go to standard output, which is normally the screen, but Unix users might like to redirect it to another file or even into a pipe. You can also use an absolute path name, like
OUTFILE /usr/bin/httpd/htdocs/stats.html # Unix OUTFILE "Hard Disk:Server Apps:WebSTAR:Analog:Report.html" # Mac
OUTPUT ASCIIyou can also select ASCII style with the command line argument +a, and HTML with the command line argument -a. You can also specify OUTPUT NONE for no output, if you are producing a cache file.
Next, you can change the language of the output. There are two ways to do this. The usual way is to use the LANGUAGE command. For example, the command
LANGUAGE FRENCHwill give you the output in French. The available languages at the moment are CHINESE, CZECH, DANISH, DUTCH, ENGLISH, US-ENGLISH, FINNISH, FRENCH, GERMAN, GREEK, HUNGARIAN, ITALIAN, NORWEGIAN (Bokmål), NYNORSK, POLISH, PORTUGUESE, BR-PORTUGUESE, ROMANIAN, RUSSIAN, SLOVAK, SLOVENE, SPANISH, SWEDISH and TURKISH.
The other way is to use the LANGFILE command. This is useful if you want to download a new language from the analog home page, or if you want to translate one yourself, or even if you want to change some words or phrases or the way the dates and times are formatted in the output. The LANGFILE command tells analog in which file to find the various words and phrases for a new language. For example, the command
LANGFILE lang/guarani.lngwould read from that file. (Note that you have to include the directory name if the file isn't in the directory or folder which you're running analog from. In particular, it's not assumed to be in the same directory as the other language files.)
Some languages also have domains files available. You can tell analog to use a different domains file instead of the English one using the DOMAINSFILE command.
If you want to translate another language, I would be delighted! You'd be wise to contact me first to make sure that no-one else is already translating the same language. The English language file contains some brief instructions for translating new languages.
IMAGEDIR img/ # within the same directory as the output IMAGEDIR /img/ # off the root directory of your server
There are three commands which affect the top line of the output. First, the LOGO command allows you to replace the analog logo with another image (for example, your organisation's logo). You can say
LOGO picture.gif # for this file LOGO /images/picture2.gif # a different file LOGO none # for no logoThe logo is assumed to be inside the IMAGEDIR unless it starts with a slash, or contains ://
hen there are commands HOSTNAME and HOSTURL which affect the name and link at the end of the title line. For example, I might specify
HOSTNAME "Stephen Turner" HOSTURL http://www.statslab.cam.ac.uk/~sret1/to generate the title "Web Server Statistics for Stephen Turner". Again, you can use none as the HOSTURL to specify no link. Analog will normally translate characters in the hostname to HTML if necessary. So to include literal HTML, such as accented characters, in the output you need to precede them by a backslash, like this:
HOSTNAME "M\üller & S\öhne"
There are commands called HEADERFILE and FOOTERFILE. These let you specify files to be inserted near the top and bottom of your output. You can specify
HEADERFILE noneto cancel a previously-specified header file.
There are three related commands called SEPCHAR, REPSEPCHAR and DECPOINT. These specify single characters to be used as the thousands separator in numbers, the thousands separator within the columns in the reports, and the decimal point. For example, a French user might choose
SEPCHAR " " REPSEPCHAR none DECPOINT ,to make "three thousand and a quarter" look like "3 000,25" in text and "3000,25" in the reports.
There is a command called RAWBYTES. Specify RAWBYTES ON if you want the exact number of bytes to be listed in reports, or RAWBYTES OFF if you want the number of kilobytes or Megabytes as appropriate to be listed instead.
Finally there is a command called PAGEWIDTH which specifies the width of the page. The output is not guaranteed to fit in this width, but analog will take notice of it when choosing the width of the time graphs, and when sorting the host report alphabetically; and if the output format is ASCII, when drawing horizontal rules and printing some bits of text. I recommend about PAGEWIDTH 65 for HTML output, and PAGEWIDTH 75 for ASCII output.
Each time report can contain columns listing the requests, requests for pages, and bytes transferred at that time, using the following code letters.
HOURCOLS Pbtells analog to include the number of page requests and percentage of the bytes, in that order, as the columns for the Hourly Summary. The other COLS commands are MONTHCOLS, WEEKCOLS, DAYCOLS (Daily Summary), FULLDAYCOLS (Daily Report), FULLHOURCOLS (Hourly Report), QUARTERCOLS and FIVECOLS. There is also a TIMECOLS command, which specifies that all the time reports are to have the specified columns.
FULLDAYGRAPH Ptells analog to plot the bar charts in the Daily Report by the number of page requests. This also controls how analog decides which is the busiest time period in the bottom line of the report. Using a lower case letter tells analog to plot the bar charts with ASCII characters instead of the normal red bars. (This produces shorter output, and it is how they appear anyway in ASCII output style, or when viewed with a non-graphical browser.) So, for example,
FULLDAYGRAPH bwould plot the Daily Report by bytes, without using the graphics. The other GRAPH commands are MONTHGRAPH, WEEKGRAPH, DAYGRAPH, HOURGRAPH, FULLHOURGRAPH, QUARTERGRAPH and FIVEGRAPH. There's also an ALLGRAPH command to set all of them simultaneously.
MONTHBACK ON # Monthly Report backwards WEEKBACK OFF # Weekly Report forwardsThe other BACK commands are FULLDAYBACK, FULLHOURBACK, QUARTERBACK and FIVEBACK. It tends to be confusing to mix directions (and analog will warn you if you attempt it) so usually you want to use the ALLBACK command which will set all of them at once.
QUARTERROWS 96 # only the last day's worth MONTHROWS 0 # 0 means no restriction: show all timeThe other ROWS commands are WEEKROWS, FULLDAYROWS, FULLHOURROWS and FIVEROWS. Even if a ROWS command is given, the line at the bottom of the report will still show the busiest time period ever, not just the busiest one in that many rows.
MARKCHAR =tells analog to use the equals sign.
There is a parameter called MINGRAPHWIDTH which sets the minimum nominal size of the graphs. For example, if you set
MINGRAPHWIDTH 10then the graph will be allowed to be up to 10 characters wide, even if that would exceed the PAGEWIDTH.
There is one more command which affects the time reports. You can specify which day should be counted as the first day of the week. This affects the layout of the Daily Report, Daily Summary and Weekly Report. For example, our local student newspaper publishes a new edition on the web every Friday, so they like to specify WEEKBEGINSON FRIDAY for their reports.
In the next section, we'll look at commands relating to the non-time reports.
First, these reports have COLS commands, just like the time reports. (See the section on Time reports for how to use these commands.) In the non-time reports, one additional column is possible, namely D for date of last access. So, for example,
REQCOLS RDlists the number of requests for each file in the Request Report, and the time when that file was last requested. The full list of COLS commands for non-time reports is HOSTCOLS, DOMCOLS, REQCOLS, DIRCOLS, TYPECOLS, SIZECOLS, REDIRCOLS, FAILCOLS, REFCOLS, REFSITECOLS, REDIRREFCOLS, FAILREFCOLS, FULLBROWCOLS (Browser Report), BROWCOLS (Browser Summary), VHOSTCOLS, USERCOLS, FAILUSERCOLS and STATUSCOLS. Not every column is allowed in every report, but if you specify an illegal one, analog will warn you about it.
HOSTSORTBY ALPHABETICALwill sort the Host Report alphabetically. The other SORTBY commands are DOMSORTBY, REQSORTBY, DIRSORTBY, TYPESORTBY, REDIRSORTBY, FAILSORTBY, REFSORTBY, REFSITESORTBY, REDIRREFSORTBY, FAILREFSORTBY, FULLBROWSORTBY, BROWSORTBY, VHOSTSORTBY, USERSORTBY, FAILUSERSORTBY and STATUSSORTBY. Again, not every sort method is possible in every report, but you'll be warned if you choose an illegal one.
There is one known bug concerned with SORTBY ALPHABETICAL. The report is sorted before any OUTPUTALIAS is applied. This means that if an OUTPUTALIAS has been specified for the report, then the report will not be sorted correctly.
DOMFLOOR 1000r # all domains with at least 1000 requests DOMFLOOR 1000p # at least 1000 requests for pages DOMFLOOR 1000000b # at least 1,000,000 bytes transferred DOMFLOOR 1Mb # at least 1 megabyte DOMFLOOR 0.5%r # 0.5% of the requests (ditto %p and %b) DOMFLOOR 0.5:r # 0.5% of the maximum number of requests # for any domain (ditto :p and :b) DOMFLOOR 970701d # last access since 1st July 1997 DOMFLOOR -00-01-00d # last access in last month (see # doucumentation on FROM and TO commands) DOMFLOOR -100r # domains with top 100 number of requests # (ditto -100p, -100b, -100d)The other FLOOR commands are HOSTFLOOR, REQFLOOR, DIRFLOOR, TYPEFLOOR, REDIRFLOOR, FAILFLOOR, REFFLOOR, REFSITEFLOOR, REDIRREFFLOOR, FAILREFFLOOR, FULLBROWFLOOR, BROWFLOOR, VHOSTFLOOR, USERFLOOR, FAILUSERFLOOR, STATUSFLOOR. Once again, not every floor method is legal for every report, but you'll be warned if you try and choose an illegal one.
There's one other command which affects the links in the Request Report. The command BASEURL prepends an additional string to the URLs in the target of the link. For example, after the command
BASEURL http://www.statslab.cam.ac.uk/~sret1/ will be linked to http://www.statslab.cam.ac.uk/~sret1/, not just to /~sret1/. This is very useful if you want to display the statistics on a different server from the server they refer to.
In the next section, we'll look at commands for generating hierarchical reports, which are closely related to the commands in this section.
First, you need to be able to control what gets listed in the reports. For this you need to use the SUB family of commands. So, for example, the command SUBDIR /~sret1/* would ensure that the Directory Report would not only contain an entry for the sum of my files, but also one for each of my subdirectories, something like this:
29,111: /~sret1/ 10,234: /~sret1/analog/ 5,179: /~sret1/backgammon/ 11,908: /~steve/
If you specify a SUB command, all the intermediate levels are included automatically. So, for example, after
SUBDOMAIN statslab.cam.ac.ukcam.ac.uk and ac.uk will be included in the Domain Report too, and after *.*.ac.uk, *.ac.uk will be included.
Here are examples of the other three SUB commands:
SUBTYPE *.gz # in the File Type Report SUBBROW Mozilla/* # in the Browser Summary REFDIR http://search.yahoo.com/* # Referring Site Report
The SUBDOMAIN report (but none of the others) can included a second argument describing the subdomain. For example
SUBDOMAIN cam.ac.uk 'University of Cambridge'Then that subdomain will be listed with its translation in the Domain Report. You can also have numerical subdomains: e.g.,
SUBDOMAIN 131.111 'University of Cambridge'If you sort the subdomains alphabetically, the numerical ones will also be sorted alphabetically, not numerically. I don't think this will cause any problems.
One other use for the SUBDIR command is if you have used the second argument to the LOGFILE command. Suppose you have translated files like /index.html into http://www.mycompany.com/index.html. Then the command
SUBDIR http://www.mycompany.com/*would be appropriate to make the directory report look right.
An sub-item is listed in a hierarchical report only if it is above the sub-FLOOR, and it is included with a SUB command, and its immediate parent is listed. For example, specifying
SUBDIR /*/*/ SUBDIRFLOOR -3r SUBDIRSORTBY REQUESTSwould list the three subdirectories with most requests under each directory. SUBDIRFLOOR 1:r would have listed any subdirectory with at least 1% of the maximum number of requests of any top level directory.
The report INCLUDE and EXCLUDE commands for a hierarchical report only apply to the top level of the report: you can use the SUB commands for the lower levels.
The three file reports (Request Report, Redirection Report and Failure Report) and the three referrer reports (Referrer Report, Redirected Referrer Report and Failed Referrer Report) are not fully hierarchical, but they do list search arguments together under the file to which they refer (provided that the arguments have been read in: see the ARGSINCLUDE command). So they have similar sub-FLOOR and sub-SORTBY commands, namely REQARGSFLOOR, REDIRARGSFLOOR, FAILARGSFLOOR, REFARGSFLOOR, REDIRREFARGSFLOOR and FAILREFARGSFLOOR; and REQARGSSORTBY, REDIRARGSSORTBY, FAILARGSSORTBY, REFARGSSORTBY, REDIRREFARGSSORTBY and FAILREFARGSSORTBY.
That concludes the description of all the output configuration commands. Now we move on to some other individual topics, starting with the domains file.
DOMAINSFILE domains.tabThis is useful if you want to use a domains file in a different language, for example. If you haven't got a domains file, you can download one from http://www.statslab.cam.ac.uk/~sret1/analog/domains.tab. It should contain each domain code followed by its location on a new line, thus:
ad Andorra ae United Arab Emirates [...]It does not need to be in alphabetical order, though humans may prefer it that way.
If you want HTML special characters in the domains file, you have to precede them with a backslash, like this:
am Arm\énie
Only domains which occur in the domains file will get their own line in the Domain Report: the rest are probably spurious, and will be accumulated together as "unknown domains". If you have debugging turned on, you can see which domains were unknown.
OUTPUT COMPUTERThis style is designed to be easy to read into spreadsheets, or post-process with graphics creation tools, for example.
Each line in the output is separated into fields by means of a special string. You can specify this string by means of the COMPSEP command; for example
COMPSEP :::if for some reason you wanted three colons between each column. Make sure not to use anything that might occur in the output: for example, a single or double space would not be suitable.
Each line in the preformatted output begins with a letter indicating which report the line is part of. (The code letters for the reports are listed in the section on Configuring the Output.) After that, there follows a field indicating the remaining columns in the report (using the letters RrPpBbD as usual). Then there are the numerical data and then the name of the item. Times actually take up several fields: year, month, date, hour & minute, or as many of those as are necessary to identify the time.
The general summary is a bit different. After an initial x, there is a two-character code saying what the line contains. The possible codes are
If you do anything interesting with this output style, I should be delighted to hear about it. Anyone want to write a program to turn it into those pretty charts that executives seem to love?
For most people, the cache file will not be needed: compressing the logfile using a standard compression utility such as gzip will be sufficient. Compressing a logfile is very efficient owing to the large number of repeated strings: I find about 12 times compression in practice. That in itself may solve your filespace problems, without needing to throw away any information.
The cache file is not the best format for post-processing the data or feeding it into a spreadsheet. For that you should use the computer readable output style.
If you are going to use the cache file feature, it is very important that you understand what is and what is not recorded. It is not possible to reconstruct everything of interest in the logfile from the cache file. The cache file does contain information about the total number of requests for each host and each file, but not about, for example, which files were read by which hosts. (To do so would take up as much disk space as the compressed logfile.) So you cannot later look at only one file and see which hosts read that file. Similarly, you cannot later restrict the files or hosts by date, using FROM and TO commands.
In summary, you should do all the inclusions and exclusions you want when you create the cache file. If you want different sets of inclusions and exclusions, you should create several cache files from the same logfile. You cannot later apply extra inclusions and exclusions accurately.
One other minor point: the pattern of failed requests and redirected requests over time is not recorded in the cache file. So although the total number will still be correct, the number in the last 7 days can be under-reported subsequently.
CACHEOUTFILE noneto turn it off again. You will still get the regular output as well as the cache output, unless you request OUTPUT NONE. Be careful not to set the CACHEOUTFILE the same as a previous CACHEOUTFILE, or you will overwrite the previous one without warning.
You can read in a previously-made cache file with the CACHEFILE command, or with the +U command line option. As with the LOGFILE command, you can use commas and wild cards to read in several cache files, and read compressed cache files using the UNCOMPRESS mechanism. Note that if you don't want to read a logfile as well as the cache file, you will have to explicitly set the LOGFILE to none.
When analog reads in a cache file, it will respect inclusions and exclusions as far as it can, but it does not apply any more aliases to the items. (This is to avoid double-aliasing.) So you must do any aliases you want at the time you create the cache file. Similarly, it does not obey the LOGTIMEOFFSET variable, to avoid double-offsetting, so any offset you want must be applied at cache-creation time too.
Sometimes you don't want to record all the types of item in the cache file. You might want to forget about which hosts had accessed your web site, for example, and only remember how many times each file was requested. You can choose not to include one type of item in the cache file by setting its LOWMEM to 3; for example, specify
HOSTLOWMEM 3to exclude hosts from the cache file. Because this is a serious step, analog will produce a warning if you do this. You can even set all six LOWMEMs to 3 if you just want to remember the pattern of requests over time, not even which files were requested.
It is legal to have the CACHEOUTFILE the same as the CACHEFILE to overwrite the old cache file with an updated one, but it is not recommended. It is best to make a separate cache file for each logfile. Failing that, it is better to write the new cache to a different file, and only delete the old cache when you have verified that the new cache was created correctly.
I prefer to make a separate cache file from each logfile, in case something goes wrong with one of them, rather than a single cache file combining several logfiles, or a single cache file combining an old cache file and a logfile.
Unfortunately DNS lookups are typically very slow, because your computer has to ask across the network to find out the names of the hosts. For this reason, analog saves the addresses it has looked up in a file, so that you don't have to look them up again next time. (Even so, you may find the DNS lookups too slow to be usable.) The file is specified by a command like
DNSFILE dnsfile.txtYou will still need to use one of the commands in the next paragraph in order to actually use the file.
There are four possible levels of DNS activity. If you specify DNS NONE, no numerical addresses will be resolved. If you specify DNS READ, then analog will read the DNS file for old lookups, but no new lookups will take place. This mode is suitable if you are running analog while not connected to the internet. The third level is DNS WRITE. This reads the old file, looks up new addresses, and adds them to the file. The fourth level is DNS LOOKUP. This reads the old file and looks up new addresses, but doesn't add the new addresses to the file, so that they will not be remembered for next time. The reason for this is that if two copies of analog were running at once, both with DNS WRITE, then it is possible that the DNS file could become corrupted (although the chance is quite small).
The first time you use DNS WRITE, you will get a missing-file warning, but it will exist the next time.
Jason Linhart has written an application for the Mac called DNSTran, which creates DNS files for analog to read. Because it uses Mac-specific code, it's faster than getting analog to create the file, and I recommend it.
Analog never deletes anything from the DNS file: this means that the DNS file will grow, and can become quite large. You should delete the top of it every so often.
There are two parameters which say how long to trust old lookups for. If you set
DNSGOODHOURS 672for example, then successful lookups will be checked again after 672 hours (4 weeks). You can also set the DNSBADHOURS similarly, to check failed lookups again after a certain time.
Finally, there is a debugging command, DEBUG +D to show all the DNS lookups that analog is making.
Recall what happens to an item when it has been read in. First it is aliased. Secondly, it is checked to see whether it is included or excluded. Then finally, if all the items are wanted, one request is added to its score.
Normally the name of the item is saved before the aliasing takes place. This avoids analog having to do the aliasing again next time the same item is encountered. But this can take up more memory than necessary. So there is a family of LOWMEM commands provided, which tell analog to record the name at a later stage, or even not at all. If you use these commands, analog will have to do a bit more work than normal, but it will use less memory. On most sites, the hosts take up most of the memory, so I'll use the HOSTLOWMEM command as an example.
The command
HOSTLOWMEM 0represents the normal case, when the hostname is recorded before being aliased. If you specify
HOSTLOWMEM 1instead, then the hostname is not recorded until after the aliasing. If you specify
HOSTLOWMEM 2then the name is not recorded until after the inclusion and exclusion lookup has been done as well. And finally, if you give the command
HOSTLOWMEM 3then the hostname is not saved at all, and the Host Report will not be constructed, even if you've asked for it. (The Domain Report can still be constructed though.) The analogous commands for the other items are FILELOWMEM, BROWLOWMEM, REFLOWMEM, USERLOWMEM and VHOSTLOWMEM.
First, remember the option we mentioned before, to list the current settings of all of analog's variables. To get this, just put -settings on the command line, or PRINTVARS ON in one of your configuration files, along with your other commands. Then analog will produce the list of settings instead of running in the normal way.
DEBUG ONyou get all the debugging. (And DEBUG OFF turns it off again.) You can also get just certain categories of debugging. The categories are
DEBUG FSwould give you information about file opening and closing, and what was in each logfile, but none of the other sorts of debugging. Each line of debugging information is prepended with its code letter. You can also specify
DEBUG +CDto add C- and D-category debugging, and
DEBUG -CDto remove them.
The WARNINGS command acts similarly. As well as WARNINGS ON and WARNINGS OFF, there are warnings in the following categories.
You can also use command line abbreviations for these commands. The DEBUG command is represented by +V (for ON), -V (for OFF), +VFS (to select options FS), +V+FS (to add those options), and +V-FS (to remove them). Similarly the WARNINGS command can be given by +q, -q, +q<options>, +q+<options> or +q-<options>.
PROGRESSFREQ 20000 # saythen analog will produce a little message after every 20,000 lines it reads from the logfile. This is useful to determine whether the program has really stopped or (as is more likely) is just being slow for some reason (such as using DNS lookups).
There is just one more section about analog's configuration commands and command line arguments, but it's a rather long one, on the form interface. (This is a way of running analog by selecting options from a web page.) You might prefer to go straight onto the section on What the results mean.
The form interface is suitable for ordinary users to use, but it needs to be set up by a system administrator or other expert. In order to set it up, you need to know what CGI programs are, where they live on your server, and how to set up their permissions properly. It would also be hepful if you can write HTML forms. I shall assume this level of background knowledge for the rest of this section. Also, I shall assume that analog has already been set up and is running properly on its own.
Warning: CGI programs can contain security loopholes which allow an unscrupulous user to harm your system. (If you don't know about this, you shouldn't be running CGI programs at all.) I have tried to make this form interface safe, but I cannot guarantee it, and take no responsibility if anything goes wrong. You use it at your own risk. (See the licence.)
The form interface consists of two parts: a form to choose the options, and a cgi program to interpret them and pass them to the analog program. You don't in fact need the form at all: if you want to create a link to the cgi program, with the arguments passed in the URL in the usual way, then that's fine.
On Unix, to compile the cgi program, you first need to edit the top of anlgform.c to indicate where analog lives on your system. Then type make form, which should compile this source into a program called anlgform.cgi.
On Windows 95 & NT, the cgi program is compiled already, and called anlgform.exe. It assumes that analog is at \analog\analog.exe, so you must move analog there if necessary.
Next put the cgi program wherever your server can find it. Make sure that analog is executable by the server, and that the logfile and domains file are readable. You will probably want to use the full path name for these files; if you don't, it will look locally to anlgform.cgi for them.
The form anlgform.html which is distributed with the program should only be regarded as an example form. Almost every configuration command has a counterpart in the CGI program, and so you can add to the form options to do almost anything you want. (The main exceptions are aliases, which are too complicated, and HEADERFILE, FOOTERFILE and LOGFORMAT, which would allow people to view any file on your system.) I shall give the full list in a minute.
Before you use the form, you must uncomment and edit the action at the top to indicate where anlgform.cgi (or anlgform.exe on Windows) lives on your server. I have also included two other important options at the top, commented out. First, it is often useful to set the logfile to be analysed (or allow the user to choose it), with a field with name="lo". Secondly, some servers need a timezone to be set in a field with name="TZ", or all the times will be wrong. If you are on Unix, you can put any of the standard timezones in this field: the correct one may well be in your own TZ environment variable.
You can specify other configuration files to be included. When analog is called by the CGI program, it first processes the default configuration file as usual. Then it processes any configuration file specified by an argument with name cg. Then it processes all the other arguments which the CGI program specifies. After that, it processes any configuration file specified by an argument with name cm. Finally, it processes the mandatory configuration file as usual.
If the option qv=1 is sent to the CGI program, then analog is not run, but a list of the configuration commands which would have been sent to analog is printed instead. This is useful for checking that the CGI program is working properly. It can also be used for using the form to produce a configuration file.
lc uc value ON/OFF q p 1 for on, 0 for off GRAPH g h ROWS r s COLS c d
lc uc value ON/OFF q p 1 for on, 0 for off FLOOR f g Excluding floor method Floor method h i r, p, b or d SORTBY s t 0 for requests, 1 for pages, 2 for bytes, 3 for date, 4 for alphabetical, 5 for random SUB j (Where applicable) SUBFLOOR w x As above Subfloor method y z As above SUBSORTBY u v As above COLS c d Report INCLUDE l m Report EXCLUDE n o
Browser b Referrer f File r Host s User u Virtual host vSecond letter:
LOWMEM k INCLUDE x EXCLUDE z
Command Code Value/Notes ALLBACK ab 1 for on, 0 for off BASEURL ba CASE ca 1 for sensitive, 0 for insensitive CONFIGFILE cg/cm See above COMPSEP cp cr Charset of language file DNSGOODHOURS da DNSBADHOURS db DECPOINT de DOMFILE df DIRSUFFIX di DNSFILE dn Also sets DNS READ; o/wise DNS is NONE FROM fr MINGRAPHWIDTH gw HOSTNAME hn HOSTURL hu IMAGEDIR ie LANGUAGE la Name of language: LANGFILE overrides CACHEFILE lc LANGFILE lf Overrides LANGUAGE LOGO lg LOGFILE lo LASTSEVEN ls LOGTIMEOFFSET lt For all logfiles REFLINKINCLUDE lw LINKINCLUDE lx REFLINKEXCLUDE ly LINKEXCLUDE lz MARKCHAR ma OUTPUT ot 0 for HTML, 1 for ASCII, 2 for COMPUTER PAGEWIDTH pw PAGEINCLUDE px PAGEEXCLUDE pz qv Output configuration file, rather than run analog RAWBYTES rb REPORTORDER re SEPCHAR sa REPSEPCHAR sb TIMEOFFSET tm TO to WARNINGS wa 1 for on, 0 for off. If unspecified, get WARNINGS FL. WEEKBEGINSON wb 0 for Sunday, 1 for Monday, ..., 6 for Saturday GOTOS xp GENERAL xq REFARGSINCLUDE yw ARGSINCLUDE yx REFARGSEXCLUDE yy ARGSEXCLUDE yz
I should say that these ideas are not new to me. In particular, I can recommend four excellent articles about this subject: Interpreting WWW Statistics by Doug Linder; Making Sense of Web Usage Statistics by Dana Noonan; Getting Real about Usage Statistics by Tim Stehle; and, the most negative of all, Why Web Usage Statistics are (Worse Than) Meaningless by Jeff Goldberg.
So, what do you know about it? First, I make one request for your front page. You know the date and time of the request and which page I asked for (of course), and the internet address of my computer (my host). I also usually tell you which page referred me to your site, and the make and model of my browser. I do not tell you my user name or my e-mail address.
Next, I look at the page (or rather my browser does) to see if it's got any graphics on it. If so, and if I've got image loading turned on in my browser, I make a separate connection to retrieve each of these graphics. I never log into your site: I just make a sequence of requests, one for each new file I want to download. The referring page for each of these graphics is your front page. Maybe there are 10 graphics on your front page. Then so far I've made 11 requests to your server.
After that, I go and visit some of your other pages, making a new request for each page and graphic that I want. Finally, I follow a link out of your site. You never know about that at all. I just connect to the next site without telling you.
The other sort of cache is on a larger scale. I'm in the UK. Because the link across the Atlantic is sometimes very congested, we've set up a national cache. (Many individual ISP's also do the same thing.) I can set my browser to get your pages from the national cache instead of directly from you. If anyone else in the country has used the cache to look at your pages recently, the cache will have saved them, and will give them out to me without ever telling you about it. So hundreds of people could read your pages, even though you'd only sent it out once. Also, if the page I wanted wasn't already stored in the cache, the cache would ask for it from you on my behalf. This would mean that the request appeared to come from the cache, rather than from me. If several people did this, you would think that only one host was accessing the cache, rather than lots of different ones.
You can also know what people told you their browsers were, and what the referring pages were. You should be aware, though, that many browsers lie deliberately about what sort of browser they are, or even let users configure the browser name. Also, some browsers send incorrect referrers, telling you the last page that the user was on even if they weren't referred by that page.
The host is the computer which has asked you for a file. The file might be a page (i.e., an HTML document) or it might be something else, such as an image. The total requests counts all the files which have been requested, including pages, graphics, etc. (Some people call this the number of hits, but that word is used in different ways by different people, so I avoid it). The requests for pages obviously only counts pages. The referrer for a request is the place that the user (or his computer) heard about your file from. If he followed a link to reach a page, it will be the previous page. In the case of a graphic on a page, the referrer will be the page containing the graphic.
First, successful requests are those with HTTP status codes in the 200's (where the document was returned) or with code 304 (where the document was requested but was not needed because it had not been recently modified and the user could use a cached copy). Sometimes the logfile line doesn't contain a status code. These lines are also assumed by analog to be successes.
Redirected requests are those with other codes in the 300's, indicating that the user was directed to a different file instead. The most common cause of these requests is that the user has incorrectly requested a directory name without the trailing slash. The server replies with a redirection ("you probably mean the following") and the user then makes a second connection to get the correct document (although usually the browser does it automatically without the user's intervention or knowledge). The other common cause of redirected requests is their use as "click-thru" advertising banners.
Failed requests are those with codes in the 400's (error in request) or 500's (server error). They come about for a variety of reasons, but the most common are when the requested file is not found or is read-protected.
Finally, requests returning informational status code are those with status codes in the 100's. These are very rare at the moment.
There are a few other types of logfile lines listed in the General Summary. Lines without status code refers to those logfile lines without a status code, and the successful requests in the General Summary only counts the ones with a status code: except if the line contains the name of the file requested, and the filename is being counted (not starred in the LOGFORMAT), then it's listed in the successes. Corrupt logfile lines are those which analog didn't manage to parse. And unwanted logfile entries are ones which we have specifically excluded. Successful requests for pages refers to those lines on which the file requested was given and was defined as a page by the PAGEINCLUDE command.
The "not listed" line at the bottom of each of the non-time reports includes both those items which were explicitly excluded at the output stage with an OUTPUTEXCLUDE command, and those which were not listed because they were below the floor for the report.
The figures in parentheses in the General Summary are for the last seven days: either the seven days before the TO time, or if no TO time is given, the seven days before the time of the program start. (It would be nicer to use the seven days before the last time in the logfile, but we don't know when this is until we've read the whole logfile, and by then it's too late.) The figures for the last seven days are not included if all, or none, of the requests fall in the last seven days.
In the Domain Report, "domain not given" means that the hostname did not contain a dot. "Unknown domain" means that it did contain a dot, but that the domain name was not in the domains file.
First, you should understand the difference between a crash, an error, a warning, and a debugging message. First, a crash is when analog exits prematurely, without producing the whole output file. The system might give a message, but analog will not give one of its own messages. Analog should never crash. If it does crash, please tell me about it.
An error is something which stops analog finishing its job. Whenever an error is detected, analog gives a message starting something like analog: Fatal error: and will then tell you what type of thing went wrong before quitting.
A warning is a problem which is not fatal to analog: it will keep on with its processing. These vary from the possibly serious, such as files which could not be found, to purely informational. They produce a message starting analog: Warning. You can turn warnings off using the WARNINGS command.
Finally, a debugging message gives information on the state of the program. They just begin with a single code letter followed by a colon. You don't get any debugging messages unless you've asked for them.
Now I shall describe all the possible errors and warnings in detail.
I get a lot of e-mail about analog, so I would appreciate it if you would do the following simple things before mailing me.
I'm sorry to be so fussy, but a lot of the mail I get really needn't have been sent at all. As I say, I really do welcome genuine mail. After all that, you can send your mail to sret1@cam.ac.uk.
There is also a mailing list for receiving news of updates to analog. To join that list, see the next section.
I'm intending to set up a mailing list soon for discussing how to use analog. Keep an eye on the analog home page for an announcement. In the mean time, you can just send mail to me.
Thanks are due to the author of getstats, Kevin Hughes. In the days before analog there were only three serious logfile analysis programs, and only one of them, getstats, had attractive output. I wrote analog when getstats stopped being able to cope with the size of our logfile, but my output still looks similar to his.
Thanks are also due to all those who helped in the early stages of writing this program, and gave me the encouragement to continue with analog and to release it publicly. Those who made helpful suggestions during beta testing are numerous, but I must mention particularly Dan Anderson, Martyn Johnson, Joe Ramey, Chris Ritson, Quentin Stafford-Fraser and Dave Stanworth. Above all Gareth McCaughan gave, and continues to give, me lots of programming advice. The program would have run much more slowly without him.
Many people have provided mirror sites for analog, starting with Dave Stanworth (again!). The full list of mirror sites is listed elsewhere; thanks to all of them.
Mark Roedel first suggested porting analog to different platforms, and made the original DOS port. Shortly afterwards, Jason Linhart made the Mac port, and has continued to contribute lots of extra code for that platform and for the program in general. The Mac version also includes code contributed by Stephan Somogyi and Nigel Perry, and uses the ZLib library by Jean-loup Gailly & Mark Adler. Later ports were made by Dave Jones (VMS), Magnus Hagander (Win32), Nick Smith (Acorn RiscOS), Scott Tadman (BeOS), and Martin Kraemer &Holger Schranz (BS2000/OSD). Ivan Martinez compiles the OS/2 version. The BS2000/OSD port includes code developed by the Apache Group for use in the Apache HTTP server project. Thanks to all the other people who have contributed bits of code too.
For the translations into other languages, many thanks are due to the following: Patrice Lafont, Lucien Vieira, Jean-Marc Coursimault &Lionel Delaude (French), Mario Ellebrecht, Martin Kraemer, Holger Schranz & Thomas Jacob (German), Javier Solis, Alexander Velasquez &Martin Perez (Spanish), Furio Ercolessi (Italian), Ivan Martinez (Brazilian Portuguese), Jaime Carvalho e Silva (European Portugese), Yang Meng (Chinese), Adrian Price (Danish), Björn Malmberg (Swedish), Jan-Aage Bruvoll, Espen Bjarnø & Pâl Løberg (Norwegian Bokmâl), Magni Onsøien (Norwegian Nynorsk), Henrik Huhtinen, Steve Kelly & Andrew Staples (Finnish), Ferry van het Groenewoud &Joost Baaij (Dutch), Dimitris Xenakis (Greek), Nezih Erkman (Turkish), Jan Simek (Czech), Stefan Billik (Slovak), Wlodek Lapot & Tomek Wozniak (Polish), Laszlo Nemeth (Hungarian), Andrej Zizmond (Slovene), San Sanych Timofeev & Boris Litvinenko (Russian), and Alex Mihaila (Romanian).
Finally, thanks to you for using the program!
*.htm
are now pages on all machines.
This is the index for this Readme. Follow the numbers after each name to find references to that command or concept. Note that families of commands are indexed under the second part of the name: for example, HOSTEXCLUDE is under *EXCLUDE, not under HOST. This index includes all of analog's configuration commands: if a command you used in previous versions is not here, see the page on Upgrading from earlier versions.
Acknowledgements [1]
Addresses, numerical [1]
*ALIAS [1]
Aliases [1]
ALLBACK [1]
ALLGRAPH [1]
analhead.h [1]
analog.cfg [1][2][3][4]
Announcements [1]
ARGSEXCLUDE [1]
*ARGSFLOOR [1]
ARGSINCLUDE [1]
*ARGSSORTBY [1]
*BACK [1]
Bar charts [1]
BASEURL [1]
Basic commands [1]
Broken pipe [1][2]
BROW* commands - see under second part of name
BROWSER [1]
Browser Report [1][2][3]
Browser Summary [1][2]
BROWSUM* commands - see under second part of name
Bugs, reporting [1]
Bytes, how displayed [1]
Cache files [1]
CACHEOUTFILE [1]
CACHEFILE [1]
CASE [1]
CGI program [1]
Colours [1]
*COLS [1][2]
Command line arguments [1][2][3][4]
[ A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z ]